A large-scale assessment of the quality of plant genome assemblies using the LTR assembly index

Abstract Recent advances in genome sequencing have led to an increase in the number of sequenced genomes. However, the presence of repetitive sequences complicates the assembly of plant genomes. The LTR assembly index (LAI) has recently been widely used to assess the quality of genome assembly, as a higher LAI is associated with a higher quality of assembly. Here, we assessed the quality of assembled genomes of 1664 plant and algal genomes using LAI and reported the results as data repository called PlantLAI (https://bioinformatics.um6p.ma/PlantLAI). A number of 55 117 586 pseudomolecules/scaffolds with a total length of 988.11 gigabase-pairs were examined using the LAI workflow. A total of 46 583 551 accurate LTR-RTs were discovered, including 2 263 188 Copia, 2 933 052 Gypsy, and 1 387 311 unknown superfamilies. Consequently, only 1136 plant genomes are suitable for LAI calculation, with values ranging from 0 to 31.59. Based on the quality classification system, 476 diploid genomes were classified as draft, 472 as reference, and 135 as gold genomes. We also provide a free webtool to calculate the LAI of newly assembled genomes and the ability to save the result in the repository. The data repository is designed to fill in the gaps in the reported LAI of existing genomes, while the webtool is designed to help researchers calculate the LAI of their newly sequenced genomes.


Introduction
The rapid development of plant whole-genome sequencing projects has been driven by next-generation sequencing technologies that offer extremely high throughput at affordable costs (Hunt et al. 2013). In recent decades, such projects have aided in producing superior crop varieties, uncover processes underpinning plant growth and development, and improve our knowledge of plant genome features such as complexity, size and architecture (Duitama et al. 2015;Cheng et al. 2016). However, current plant genome data derived from nextgeneration sequencing suffer from complications that make comprehensive chromosomal reconstruction difficult, such as read errors and large repeats in the genome (Mikheenko et al. 2018). Assessing the quality of genome assemblies is becoming increasingly important, both for assembly and reassembly and for the use of assembled genomes in downstream analysis (Yang et al. 2019). Because the precision of plant reference sequence data is critical to the interpretation of downstream functional genomic analysis, a measure of the quality of the genome sequences is needed.
Several methods have been developed to assess the quality of genome assemblies based on different concepts. These methods fall into two categories: length-based metrics and annotation-based metrics (Bradnam et al. 2013;Thrash et al. 2020). The N 50 value is a length-based metric that represents the shortest fragment size at half the genome size . The N 50 contig value is a commonly used method for evaluating the quality of assemblies, especially for determining the contiguity of assemblies. Contiguity is used to show how complete are assembled genomes and how many fragments or 'contigs' exist in the sequence. However, these values can be deceptive because the N 50 value also increases when contigs are not assembled correctly. Moreover, they do not provide complete information about the completeness of genome assemblies (Gurevich et al. 2013;Hunt et al. 2013;Manchanda et al. 2020). The second category of metrics tests completeness by analyzing expected genome content, such as Benchmarking Universal Single-Copy Orthologs (BUSCO) and LTR Assembly Index (LAI; Simão et al. 2015;Waterhouse et al. 2018). BUSCO is commonly used to assess the absence or presence of multiple highly conserved orthologous genes (Simão et al. 2015). However, most recently assembled genomes as well as draft genomes have high BUSCO values, which is not sufficient to demonstrate genome completeness . While BUSCO can only assess gene space, the LTR Assembly Index (LAI) is efficient in estimating genome completeness in more repetitive genome regions by calculating the percentage of intact LTR retrotransposon (LTR-RT) sequences Feron and Waterhouse 2022). Therefore, LAI is the efficient metric for analyzing plant genome assemblies that are often rich in repeats. Improvements in next-generation sequencing technologies for obtaining long-read genomes have recently led to a remarkable increase in the complete coverage of repetitive regions of plant genomes. Consequently, measurements from LAI have become extremely important .
The dynamics of whole-genome duplication and transposable elements (TE) are the main mechanisms responsible for the wide diversity of genome sizes in plants ( Leebens-Mack et al. 2019). Within a given ploidy level, plant genome size and TE content are often linear (Lee and Kim 2014), so plant genome size varies mainly due to TEs, which makes plant genomes much more complicated than vertebrate genomes (Jiao and Schneeberger 2017). In some genomes, TE content ranges from 3 % to over 85 % and the genome size is positively correlated with the TE content (Kress et al. 2022). LTR-RTs are a class of TE scattered throughout most plant genomes and range in size from 4 to 20 kb (Mokhtar et al. 2021a(Mokhtar et al. , 2023. Intact LTR-RT elements in the final plant genome assemblies yielded more intact elements than in the draft genomes, supporting the use of LAI as a measure of genome sequence quality and completeness . Here we provide a large-scale assessment of the quality of the assembled genomes of 1664 plant and algal species using the LAI values and report the results in a data repository called PlantLAI, available at https://bioinformatics.um6p.ma/ PlantLAI. We also provide a free web tool to calculate the LAI value of newly assembled plant genomes. The data repository is intended to fill in the gaps in the reported LAI values of existing genomes, while the web tool is intended to help researchers calculate the LAI values of their newly sequenced genomes. Today, our understanding of plant biology depends heavily on genomic data, for example, in identifying genomic regions that control plant-microbe interactions (Eid et al. 2021), exploring variations in gene expression under environmental stress (Omar et al. 2021), genome-wide identification studies (Atia et al. 2017;Ahmed et al. 2021) and clarifying the effects of genomic diversity among plant species (Mokhtar et al. 2021b). Providing a measure of plant genome quality will improve all downstream applications that rely on plant genome data.

Materials and methods
A total of 1664 plant and algal genome sequences were retrieved from the NCBI database, including 1509 land plant genomes and 155 algal genomes. These genomes represent 704 plant species and 129 algal species. Of the 1509 land plants, 1456 are diploid genomes and 53 are polyploid genomes. For polyploid genomes, only the sequences assigned to one of their sub-genomes were analyzed. Some polyploids have more than one genome sequence in the NCBI database, here we only considered the sequences assigned to their sub-genomes. The species names, taxonomic groups, BioSample, BioProject, GenBank assembly accession, assembly level, genome size, and number of scaffolds/ chromosomes of all genomes are provided in Supporting Information- Table S1.
The first step of the PlantLAI workflow is to detect LTR-RT candidates with LTRharvest (Ellinghaus et al. 2008), LTR_FINDER_parallel (Ou and Jiang 2019). The results are full-length LTR-RTs combined to detect the intact LTR-RTs with LTR_retriever which uses BLAST + (Camacho et al. 2009), HMMER (Wheeler and Eddy 2013), RepeatMasker (Smit et al. 2015), CD-HIT (Fu et al. 2012) and Tandem Repeats Finder (TRF) (Benson 1999). The intact LTR-RTs and the annotation of RepeatMasker are passed to the LAI program to compute the LAI values ( Fig. 1). The LAI database and webserver use PHP 7.4.3, MongoDB 6.0, Apache 2.4 and Linux 5.4.0-89-generic x86 64. The server is powered by 16-core CPUs and has 32 GB RAM and a 10-TB hard disk. HTCondor 9.5 was used to manage and control the submitted tasks and jobs. The LAI online workflow includes the steps of receiving data from users, sending the data to the high-performance computer (HPC), receiving the results and preparing them for download through the web interface (Fig. 2). In-house scripts are used to transfer data between the server and the cloud. The jobs are queued and run on Toubkal (POWEREDGE C6420, CRC-STACKHPC, XEON PLATNIUM 8276L 28C 2.2GHZ, MELLANOX INFINIBAND HDR100), Africa's fastest supercomputer as on August 2022 (Top 500.org 2022). LTRharvest (Ellinghaus et al. 2008), LTR_FINDER_ parallel (Ou and Jiang 2019) and LTR_retriever  were used for LTR-RTs identification, while the LAI program ) is used to estimate the LTR assembly index. The parameter settings for LTR_FINDER_ parallel were as follows: -seq genome -threads 56 -har-vest_out -size 1000000 -time 300. Parameter settings for LTR_FINDER (Xu and Wang 2007)  -threads 56 -gt genometools -size 1000000 -time 300. For LTRharvest, the parameters in the parallel version were '-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes'. LTR_retriever parameters were as follows: -genome genome, -inharvest genome.rawLTR.scn and -threads 56. For LTR assembly index, LAI beta3.2 ) was used with the command line (LAI -genome genome -intact genome.pass.list -all genome.out -t 56 -q -blast) for diploid genomes. The parameter (-mono chromosomes.ids) is used for polyploid genomes.

Results and Discussion
Recent advances in genome sequencing have led to an increase in the number of sequenced plant genomes (Mokhtar et al. 2021b). However, the presence of repetitive sequences complicates the assembly of a plant genome . Therefore, the growing number of assembled genomes in databases has increased the need to assess their quality . LAI was developed to assess the assembly of repetitive sequences and the completeness of assembly . To assess LAI of an assembled genome sequence, intact LTR-RTs and total LTR-RTs should represent at least 0.1 % and 5 % of the total genome size, respectively ).
In the current study, to estimate the LAI value of 1664 land plant and algal genomes, a number of 55 117 586 pseudomolecules/scaffolds with a total length of 988.11 gb pairs were examined using LAI workflow (Fig. 1). For land plants, a total of 30 701 357 LTR-RT candidates were detected, including 6 583 551 LTR-RTs that passed the LTR_retriever filtering step and 24 117 806 false LTR-RTs elements. Only LTR-RT candidates that passed the LTR_retriever filtering step were used for further analysis.  Table S2]. Because LAI can be used to identify low-quality genomic regions, the LAI and raw LAI values of pseudomolecules/scaffolds and their fragments (3 Mb) were estimated. Due to the large amount of data, the values for whole-genomes/pseudomolecules/scaffolds and their fragments for each genome can be retrieved from the PlantLAI search page.  proposed a genome classification system for assembling repetitive and intergenic sequence space using the LAI value. They suggested that the LAI value for draft genomes is less than 10, whereas the LAI value for reference genomes is between 10 and 20, and gold-quality genomes have an LAI value greater than 20. Based on this quality classification system, diploid genomes were classified into 476 draft genomes, 472 reference genomes and 135 gold genomes, while polyploid sub-genomes were classified into 16 draft genomes, 98 reference genomes and 13 gold genomes. Polyploid genomes contain multiple sub-genomes that are not necessarily of the same quality ([see Supporting Information-Tables S2 and S3]; Fig. 3). Supporting Information- Figure S1 shows the histograms of the raw LAI and LAI values for the plant and algal genomes studied.
Considering NCBI reference genomes, 15 assembled genomes of cereal crops belonging to the Poaceae family are included in the analysis. The evaluation revealed that the genomes of Zea mays (GCA_902167145.1), Sorghum bicolor (GCA_000003195.3), Oryza sativa (GCA_001433935.1), Setaria viridis (GCA_005286985.1) and Panicum hallii (GCA_002211085.2) can be classified as gold genomes because their LAI values range from 21 to 29.45. Since the LAI values of Aegilops tauschii (GCA_002575655.2) and Dichanthelium oligosanthes (GCA_001633215.2) were 6.41 and 7.85, respectively, these genomes were classified as draft genomes, while the remaining genomes studied were considered as reference genomes. The Hordeum vulgare (GCA_904849725.1) and Zea mays genomes were the richest genomes in the Poaceae family in terms of their LTR-RTs content, accounting for 83.8 % and 81.9 % of the total genome, respectively (Fig. 4). Since more intact LTR-RTs are detected in these instances, higher LAI values are associated Figure 2. LAI online workflow, including the steps of receiving the data from the users, sending the data to the HPC, receiving the results and displaying them on the web interface.
with higher quality assembly of intergenic and repetitive sequence regions of the genome Takei et al. 2021). Recently, the LAI method has been widely used to assess the quality of genome assembly of cereal crops, such as rice (Yang et al. 2022), maize ) and rye (Li et al. 2021). However, few reports examined the quality of the reference genomes on which the assembly of these genomes is based, and which are sometimes used for comparative analysis. PlantLAI fills the gap and provides LAI values of several versions of plant and algal genomes.
In addition, 19 NCBI reference genomes of legumes belonging to the Fabaceae family were evaluated. The analysis revealed that only the genomes of Trifolium pratense (GCA_020283565.1) and Medicago truncatula (GCA_003473485.2) could be classified as gold genomes, as their LAI values were 28.1 and 21.2, respectively. The genomes of Vigna unguiculata (GCA_004118075.2), Abrus precatorius (GCA_003935025.1), Prosopis alba (GCA_004799145.1) and Spatholobus suberectus (GCA_004329165.1) were classified as reference genomes, their LAI values ranged from 10.91 to 15.13, respectively. On the other hand, 13 genomes, including genomes of important legumes such as Glycine max (GCA_000004515.5), Phaseolus vulgaris (GCA_000499845.1), Vigna radiata (GCA_000741045.2) and Cicer arietinum (GCA_000331145.1), were classified as draft genomes. The genome of Arachis duranensis (GCA_000817695.3) was the richest genome of the Fabaceae family in terms of LTR-RTs and accounted for 50.6 % of the total genome (Fig. 4). Recently, Xi et al. (2022) used LAI to estimate the genome of Vicia sativa which was constructed using a mixture of long-reads from the Oxford Nanopore sequencing technology and short-reads from the Illumina sequencing technology. The LAI value was 12.96, which qualified the new genome to be considered as a reference genome.
The Solanaceae family contained 11 NCBI reference genomes among the evaluated genomes in this study, including three genomes from the genus Nicotiana, three genomes from the genus Capsicum and five genomes from the genus Solanum. The evaluation showed that all the NCBI reference genomes could be classified as draft genomes, as their LAI values ranged from 5.05 to 9.87. We could not perform the LAI analysis for the Solanum chilense (GCA_006013705.1) genome because the content of intact LTR-RT was 0.09 %, which is too low for accurate calculation of LAI. At least 0.1 % intact LTR-RT is required within the genome architecture. The genome of Nicotiana tabacum (GCA_000715135.1) was the richest genome of the Solanaceae family in terms of LTR-RTs, accounting for 75.07 % of the total genome (Fig. 4). Many previous studies have characterized LTR-RTs of Solanaceae family species such as Capsicum annuum (de Assis et al. 2020;Mokhtar et al. 2023), Solanum melongena (Barchi et al. 2019), Datura stramonium (De-la-Cruz et al. 2021) and Solanum lycopersicum (Paz et al. 2017;Mokhtar et al. 2023). In a recent study, the LAI method was used to verify the quality of de novo assembly of the genomes of wild tomatoes Solanum pimpinellifolium and Solanum lycopersicum var. cerasiforme which were sequenced with the PacBio Sequel system (Takei et al. 2021). The two genomes were categorized as reference because their LAI values were 14.18 and 13.10, respectively. In another recent study, Oxford Nanopore's long-read sequencing technology was used to obtain the de novo genome sequence of a potato. The new genome was categorized as a reference genome because its LAI value was 13.56 (Pham et al. 2020).

PlantLAI online data repository
Data generated by the LAI workflow of 1664 land plants and algal genomes were used to create an interactive web interface for LAI index. The Plant LAI (PlantLAI) website is accessible through the public link (https://bioinformatics.um6p. ma/PlantLAI). The PlantLAI data can be easily searched and downloaded from the website.
The PlantLAI search menu is divided into three individual search pages. The first page allows searching the LTR Assembly Index for selected diploid plant species. The second page allows searching for polyploid genomes, and the third page is designated for searching algal genomes. The LTR Assembly Index search page provides researchers with the LAI value for the entire genome of the selected species. In addition, a search option for a specific chromosome/scaffold is available. All results are displayed on the same page with information about LAI, including chromosome/scaffold, start and end of the analyzed sequence, percentage of intact LTR, percentage of total LTR, raw LAI, and LAI values (Fig. 5).
The download page can be accessed from the top menu of any page. The page provides option to download the bulk data as zipped files, including repeatmasker output, LTR-RTs passed by the LTR_retriever filter stage in bed format, the gff3 file for the identified intact LTR-RTs, the LAI table of the whole genome, pseudomolecules, scaffolds and their fragments, and the table of identified intact LTR-RTs.

The LAI webserver
The web interface of the LAI pipeline tool is designed to retrieve the data required for the LAI workflow. The pipeline accepts whole-genome sequences in FASTA format. Users can upload genome file from a local computer or via an NCBI-FTP link. Users are also prompted to set the evolution rate and select the closest species for tRNA genes. The online LAI workflow consists of the identification of LTR-RT and the calculation of LAI using LTRharvest, LTR_FINDER, LTR_retriever and the LAI program (Figs 1 and 2).
The LAI webserver generates a series of files listing the LAI values, identified intact and non-intact LTR-RTs. The results table shows some key details such as LTR-RT superfamily, LTR-RT insertion age, percentage of intact LTR-RT, percentage of total LTR-RT, raw LAI and LAI values. The generated files include LTR-RTs identified by LTR_FINDER, LTRharvest and LTR_retriever in fasta, bed and gff3 formats. The output also includes all LTRs identified by the RepeatMasker software. The values of raw LAI and LAI for the whole genome, pseudomolecules, scaffolds and their fragments (3 Mb in size) are also made available for download and can be received by email if provided. The user can choose the option to save the results of the run, which will be added to the PlantLAI database after validation.

Conclusion
The LAI method has recently been widely used to assess the quality of genome assembly after whole genome sequencing because a higher value of LAI is associated with higher quality assembly of the intergenic and repetitive sequence of the genome. Here, we assessed the quality of the assembled genomes of 1664 land plant and algal genomes using the LAI values and reported the results in a data repository called PlantLAI. We also provide a free web tool to calculate the LAI value of newly assembled plant genomes. The purpose of the data repository is to provide LAI values of existing genomes, while the web tool should allow researchers to calculate LAI values of their newly sequenced genomes. PlantLAI will be continuously updated with LAI values of newly deposited genomes as well as any updated reference genomes.

Supporting Information
The following additional information is available in the online version of this article - Figure S1. The histograms of the raw LAI and LAI values for the plant and algal genomes studied. Table S1. The species name, taxonomy groups, GenBank assembly accession, BioSample id, BioProject id, Assembly level, genome size and number of scaffolds and chromosomes of all studied genomes. Table S2. LAI values of all studied genomes Table S3. The LAI values for 136 genomes with LAI values higher than 20 (gold quality)